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Abstract 

Tensors naturally model many real world processes which 
generate multi-aspect data. Such processes appear in 
many different research disciplines, e.g, chemometrics, 
computer vision, psychometrics and neuroimaging anal- 
ysis. Tensor decompositions such as the Tucker decom- 
position are used to analyze multi-aspect data and extract 
latent factors, which capture the multilinear data structure. 
Such decompositions are powerful mining tools, for ex- 
tracting patterns from large data volumes. However, most 
frequently used algorithms for such decompositions in- 
volve the computationally expensive Singular Value De- 
composition. 

In this paper we propose MACH, a new sampling al- 
gorithm to compute such decompositions. Our method 
is of significant practical value for tensor streams, such 
as environmental monitoring systems, IP traffic matrices 
over time, where large amounts of data are accumulated 
and the analysis is computationally intensive but also in 
"post-mortem" data analysis cases where the tensor does 
not fit in the available memory. We provide the theoretical 
analysis of our proposed method, and verify its efficacy in 
monitoring system applications. 

Categories and Subject Descriptors: 
General Terms: Algorithms; Experimentation. 
Keywords: Tensors; Tucker Decompositions; SVD 

1 Introduction 

Numerous real-world problems involve multiple aspect 
data. For example fMRI (functional magnetic resonance 
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imaging) scans, one of the most popular neuroimaging 
techniques, result in multi-aspect data: voxels x subjects 
x trials x task conditions x timeticks. Monitoring sys- 
tems result in three-way data, machine id x type of mea- 
surement x timeticks. The machine depending on the set- 
ting can be for instance a sensor (sensor networks) or a 
computer (computer networks). Large data volumes gen- 
erated by personalized web search, are frequently mod- 
eled as three way tensors, i.e., users x queries x web 
pages. 

Ignoring the multi-aspect nature of the data by flat- 
tening them in a two-way matrix and applying an ex- 
ploratory analysis algorithm, e.g., singular value decom- 
position (SVD) ( 11221 ). is not optimal and typically hurts 
significantly the performance (e.g., |51 1). The same holds 
in the case of applying e.g., SVD on different 2-way slices 
of the tensor as observed by 1281 . On the contrary, mul- 
tiway data analysis techniques succeed in capturing the 
multilinear structures in the data, thus achieving better 
performance than the aforementioned ideas. 

Tensor decompositions have found the last years many 
applications in different scientific disciplines. Indica- 
tively, computer vision and signal processing (e.g., BTl 
[35ll43ll ). neuroscience (e.g., f5)), time series anomaly de- 
tection (e.g., |47l ), psychometrics (e.g., ||49ll ), chemomet- 
rics (e.g., l44l ). graph analysis (e.g., ||25ll45l ), data min- 
ing (e.g., l48l ). Two recent surveys of tensor decomposi- 
tions and their applications are |26l .l2l. with a wealth of 
references on the topic. 

Two broad families of decompositions are used in the 
multiway analysis, each with its own characteristics: the 
canonical decomposition (parallel factor analysis), a.k.a. 
CANDECOMP (PARAFAC) J6l[l9l,and the Tucker fam- 
ily of decompositions (49). In this paper, we focus on 
the latter. The Tucker decomposition can be thought of 
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as the generalization of the Singular Value Decomposi- 
tions (SVD) to the multiway case. Even if there exist al- 
gorithms which cast the Tucker decomposition as a non- 
linear optimization problem (e.g., Bill . |fl~)), currently in 
practice the approach followed is the Alternating Least 
Squares, which involves the computationally expensive 
SVD. To speed up tensor decompositions, randomized 
algorithms |[T4l [34l have appeared in the recent years. 
This family of randomized algorithms are generalizations 
of fast low rank approximation methods iTTTl [33l [LP , 
adapted appropriately to the multiway case. 

In this paper we propose a simple randomized algo- 
rithm that speedups significantly the Tucker decomposi- 
tion while at the same time results with guarantees in an 
accurate estimate of the tensor decomposition. MACH, 
the proposed method, can be applied both to "post- 
mortem" data analysis and to tensor streams to perform 
data mining tasks such as network anomaly detection, and 
in general the set of mining tasks which rely on the study 
of a low rank Tucker approximation. MACH is useful 
when the data does not fit into the available memory and 
also in tensor streams, such as computer monitoring sys- 
tems, which is also the main motivation behind this work. 
Specifically, one of the monitoring systems of Carnegie 
Mellon University, monitors and uses data mining tech- 
niques to detect failures. Currently, it monitors over 100 
hosts in a prototype data center at CMU. It uses the SNMP 
protocol and it stores the monitoring data in an mySQL 
database. Mining anomalies in this system is performed 
using the SPIRIT method and its extension in the multi- 
way case, i.e., the two heads method which uses a Tucker 
decomposition and treats the time aspect using wavelets 
ll37l |2T1 l47l . Applying the aforementioned methods on 
large volumes of data is a challenge. 

It is worth outlining at this point that in many data min- 
ing applications preserving a constant number of principal 
components almost the same is of high practical value: 
a low rank approximation typically captures a significant 
proportion of the variance in many real world processes 
and outliers can be detected by examining their position 
relative to the subspace spanned by the PCs. 

It is also worth noting that despite many cases where 
the formulated tensor is sparse, i.e., few non zero elements 
as observed in J27), there exist real world problems where 
the tensor is dense. As table[T]shows, for both monitoring 
system we use in the experimental section HI the result- 



ing tensors are very dense. This is the typical case in a 
monitoring system, since at timetick k we receive a mea- 
surement of type j for machine i, resulting in a non zero 
in (i,j,k). 



name 


Percentage of non-zeros 


Sensor 


85 % 


Network Data ifTUI 




Computer 


81% 


Network Data ( \\2l\\ ) 





Table 1: Tensors from monitoring system are typically 
dense. 

The main contributions of this paper are summarized as 
in the following: 

• MACH, a randomized algorithm to compute the 
Tucker decomposition of a tensor X. MACH is 
embarrassingly parallel, and adapts easily to tensor 
streams. 

• The following theorem, which is our main theoreti- 
cal result: 

Theorem 1 Let X e R*ix*»<...x/ d a d-mode ten- 



sor. Let I n > 76, ll < n*=i I 3 f°r 
andb = max; , . \Xi lt ...,i d \. 

For p > max,- — /Tld ? 



1, 



let X e 



3 (Y\ d I, 

■^iixi 2 x...xi d fo £ a i ensor wnose entries are inde- 
pendently distributed as: Xi^...^ = ll " p '' ld with 
probability p, otherwise 0. 

Let X be the (r\ , . . . , Td)-rank approximation of X 
given by its HOSVD : 

(1) X = X x x A^A^ T x 2 ...x d AWAW T 

where A^ m ^ is a I, n x r m matrix containing the r m 
top left singular vectors of the matricization of X 
along the m-th mode. 

Let Xu\ ri ,Xu\ r . denote the rank ri approxima- 
tion of the matricizations X^, Xn\ of tensors X ', X 
along mode i respectively. Then with probability at 

least rii=i( 1 - ea; P(- 19 Efc=i,fc/j log 4)) the fol- 
lowing holds: 
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(2) 



\X -X\\<t 



where t is given by the following equation: 
t = miiij =1 ...dti where 

U = I - X (i) , ri 1 1 + 46(a li;; /„)*- 

4(||X (i)i „||6)l(^ntw^i) i + 
X, 



Ed 



I \ x {j) 



• Experiments on monitoring systems, where we 
demonstrate the success of our proposed algorithm. 

The outline of the paper is the following: in Section [2] 
we briefly present the necessary theoretical background, 
in Section|3]we describe and analyze the proposed method 
and in Section |4] we present the experimental results. We 
conclude in Section|5] 



2 Background 

In this section we briefly present the background behind 
tensors and low rank approximations. Table [2] shows the 
symbols and the abbreviations we use and their explana- 
tion. 



2.1 Tensors 

Historical Remarks Tensors traditionally have been 
used in physics (e.g., stress and strain tensors). After Ein- 
stein presented the theory of general relativity tensor anal- 
ysis became popular. Certain ideas on multi-way analysis 
data back in 1944 and 1952 and are due to Raymond Cat- 
tell ll38l[39l . Tucker introduced tensor analysis in psycho- 
metrics [49] (Tucker family). Harshman |[T9l and Car- 
rol and Chang |_6] independently proposed the canoni- 
cal decomposition of a tensor (CANDECOMP family). 
These two families of decompositions come with different 
names, see IT261 . The difference between them is visual- 
ized for a three way tensor in figure |2~T1 In the following 
we will focus on Tucker decompositions. 



Symbol 


Definition and Description 


d 


number of modes 


h 


dimensionality of 




j-th mode 


x,y,. . . e R-'ix-x-'d 


d-mode 




tensor (calligraphic) 


X 


tensor obtained upon 




applying MACH on X 


A,U,...£ R mx ™ 


matrices (upper case) 




scalars (lower case) 




matricization of X, X 




along mode i 




r.i rank approximation 




of the matricizations 






x„ 


mode-n product 


HOOI 


Higher Order Orthogonal 




Iteration ||30l 


HOSVD 


Higher Order Singular 




Value Decomposition [8] 



Table 2: Symbols 



Tensor Concepts Let X e 



dI 1 X J 2 X ■■■ Xld 



be a multi- 



way array. We will call X a tensor, i.e., we will use the 
terms multiway array and tensor interchangeably. The or- 
der of a tensor is the number of dimensions, also known 
as ways, modes or aspects and is equal to d for tensor X. 
The dimensionality of the j-th mode is equal to Ij. 

The norm of tensor X is defined to be the square root 
of the sum of all entries of the tensor squared, i.e., 



(3) 



\X\ 



\ ii=ij2=i 



As we see the norm of a tensor is the straight-forward gen- 
eralization of the Frobenius norm of a matrix (2 modes) 
to TV modes. 

The inner product of two tensors with the same num- 
ber of modes and equal dimensionality per mode, X, y £ 
M? 1 x 12 x ■ ■ ■ x Id , is defined by the following equation: 

h 



EE 



3 



TUCKER3 




'IP 



CANDECOMP/PARAFAC 





+ + 



X = ii 

"(II 1 



l 2 l 3 



Figure 1: CANDECOMP/PARAFAC and Tucker tensor 
decompositions. 



Figure 2: Matricization of a three-way I\ x I2 x I3, 1% = 3, 
tensor along the first mode. The three slices are denoted 
with different color. 



Observe that equation [3] can equivalently be written as 
1 1 A? 1 1 = \J {X, X) A tensor fiber (slice) is a one (two)- 
dimensional fragment of a tensor, obtained by fixing all 
indices but one (two). For more details on tensor fibers 
and slices, see 11261 . 

Matricization along mode k, results in a 4 x 
(=1 j, k Ij matrix. The . . . , id) element is mapped 

1 + Eg=i,^ fc («9 - 1)^<? where 
Figure 12.11 shows the concept of 



n 

to (ik,j) where j 



Y[m=l,m^k I" 



matricization for a three-way tensor. The operation of 
matricization naturally introduces the concept of a vec- 
tor containing ranks (n, . . . , r<j): r, is equal to the rank 
of the X^, the matrix resulting by the matricization of 
the tensor X along the i-th mode. 

The n-mode product of X with a matrix M G M' /x/ " 
is denoted by X x „ M and is a tensor of size I\ x I 2 x 
. . . I n -\ x J x I n+ i x . . .Id- Specifically, 



(5) (ATXnAQfc. 



Some important facts concerning n-mode products, is the 
following: 



(6) 



X x TO A x„ B — X x n B x m A,rn ^ n 



The importance of this equation lies in the fact that the 
order of execution of the tensor matrix products does not 
play any role, as long as the multiplications are along dif- 
ferent modes. When we multiply a tensor and two matri- 
ces along the same mode the following equation holds: 



(7) 



X x m A x m B = X x m {BA) 



Furthermore, if UU T = I then the following equation 
holds: 



(8) 



||Ax n C/|| = PII 



The rank R of the d-way tensor X is the minimum num- 
ber of d-linear components to fit X exactly, i.e.,: 



(9) 



X = 



R 

E 

m— 1 



c (l) o c (2) 



(i) 

where c\ , 



are the R components for the 7-th 
mode and o denotes the tensor product. Even if the above 
generalization is a straightforward generalization of the 
rank of a matrix, the concept of the tensor rank is special. 
For example, for a matrix A € 
R c and the row rank R r are equal R 



2x2 the column rank 
R r = r to the 
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matrix rank r. Furthermore, r < 2. However for a tensor 
X e R2X2X2 me rank can be 2 or 3 J29l. Therefore the 
word rank can have different meanings: a) The individual 
rank, i.e., for a specific instance of a tensor what is BP. b) 
The typical rank is the rank that we almost surely observe. 
For example for 2x2x2 tensors the typical rank is {2, 3}. 
c) Vector of ranks (n, . . . , 7*d). The value of r, is equal to 
the rank of the matricized version Xu\ of the tensor. 

Consider figure |2~T1 which depicts a three mode tensor 
X e R^x^x^. The PARAFAC/CANDECOMP model 
is given by equation [TOj whereas the Tucker model is 
given by equation QT] 

R 

(10) Xijk = airbjrCqrXr + &ijk 

r=l 



Tucker3 Algorithms The algorithm which should be 
used to compute the Tucker3 decomposition of a tensor 
depends on whether or not the data is noise free. In 
the former case, an exact, closed form solution exists, 
whereas in the latter case the alternating least squares al- 
gorithm (ALS) is frequently used. However, it is worth 
noting that even in cases where there is noise in the data, 
the closed form solution a.k.a. as HOSVD 11261 [HI is satis- 
factory in practice ll32l . 

Let X e M /x xLl x/ 3 and (n , r 2 , r 3 ) the vector contain- 
ing the desired approximation ranks along each mode. In 
the case of noise-free data, the algorithm matricizes the 
tensor along each mode and computes the Tk top left sin- 
gular vectors k = 1,2,3. Let Ak be the Ik x matrix 
containing in its columns those vectors. The core tensor 
is computed with the following equation: 



P Q R 

(11) X{jk — ^ ^ ^ ^ &ipbjqCkr9pqr &ijk 

p—1 q—1 r — 1 

Few brief remarks on the above two models: a) In 
terms of the fit, the Tucker family is at least as good 
as the PARAFAC/CANDECOMP since as we see from 
the above equations, the PARAFAC model can be viewed 
as a restrictive Tucker model, where the core tensor Q 
is superdiagonal, i.e., g pqr ^ only if p = q = r. 
However, it is worth noting that better fit is not nec- 
essarily optimal (see |44| , Ch.7) b) The Tucker model 
does not result in unique solutions since it has rotational 
freedom. Typically one chooses a solution that satis- 
fies a certain criterion, as the all-orthogonality core ten- 
sor: (G(m, :, :), G(n, :, :)) = (G(:, m, :), G(:, n, :)) = 
(<?(:, :,m),G(:,:,7i)) = when m ^ n (10). c) Ba- 
sic concepts as the uniqueness of the canonical tensor de- 
composition, degeneracy of the rank, border rank are not 
discussed. A good reference is fl26l and the related refer- 
ences therein. 

In the following we focus on the Tucker family. Com- 
pressing n out of the d modes of a tensor results in 
a Tucker-n decomposition ( ll24l ). For example, for a 
three mode tensor we can have the Tuckerl, Tucker2 and 
Tucker3 decomposition. In the following we discuss al- 
gorithms for the Tucker3 decomposition and briefly state 
some facts about Tucker2 and Tuckerl decompositions. 
Generalization to d modes is straightforward. 



(12) Q = X Xl A\ x 2 Al x 3 Aj 

In the case of noise in the data, one performs the al- 
ternating least squares algorithm. To solve the nonlinear 
optimization problem that tries to optimize the fit of the 
low rank approximation with respect to the original ten- 
sor, one converts the problem into a linear one, by "fix- 
ing" all modes but one and optimizing along that mode. 
This method is also known as Higher Order Orthogonal 
Iteration (HOOI). This procedure is continued until some 
stopping criterion is met, i.e., e improvement in terms of 
fit. 



Further Remarks Hastad proved that the tensor rank 
is an NP-complete problem 1201 . Lek-Heng Lim has pro- 
posed a theory for eigenvalues, eigenvectors, singular val- 
ues and singular vectors 1311 . Maximum constraint sat- 
isfaction problems (MAX-rCSP) have been casted as a 
tensor decomposition problem (sum of rank one compo- 
nents). In Q is proved that there is a PTAS (polynomial 
time approximation scheme) for a family of MAX-rCSP 
(i.e., core-dense). Sheehan and Saad in l42l give a unified 
view of different dimensionality reduction techniques un- 
der the tensor framework. A wealth of applications that 
use tensor decompositions exist, [26] contains a wealth of 
such references. 
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2.2 SVD and Fast Low Rank Approxima- 
tion 



,mxn can k e wr itt en as a sum f rank 

A = J2l=i ViUivJ , where u^i = 



Any matrix A G ] 
one matrices, i.e.. 
1 . . . r (left singular vectors) and Vi,i = 1 . . . r (right sin- 
gular vectors) are orthonormal and the singular values are 
ordered in decreasing order <j\ > . . . > a r > 0. Here r is 
the rank of A. We denote with Ak the fc-rank approxima- 
tion of A, i.e., Ak = X)i=i cr i u i w f - Among all matrices 
C G jj" i xn G f ran j c at mos t k, Ak is the one that mini- 
mizes | \A — C\ |f ([22]). Since the computational cost of 
the SVD is high, 0(min (m 2 n, n 2 m)) for the full SVD 
approximation algorithms that give a close to the optimal 
solution Ak have been developed. Frieze, Kannan and 
Vempala showed in a breakthrough paper ifTSll that an ap- 
proximate SVD can be computed by a randomly chosen 
submatrix of A. It is remarkable that the complexity does 
not depend at all on m,n. Their Monte-Carlo algorithm 
with probability at least 1 — 5 outputs a matrix A of rank 
at most k that satisfies the following equation: 

(13) \\A-A\\%<\\A-A k \\ 2 F + e\\A\\ F 

Drineas et al. in lfl2l showed how to find such a low rank 
approximation in 0(mk 2 ) time. A lot of work has fol- 
lowed on this problem. Here, we present the results of 
Achlioptas-McSherry [3] which are used in our worlfl 
The main theorem that is of interest to us is theorem|2] 

Theorem 2 (Achlioptas-McSherry [3j) Let A be any 

to x ft matrix where 76 < in < n and let b = maxy \Aij \. 
For p > (8 \ogn) A /n. Let A be a random m x n 
matrix whose entries are independently distributed, with 
Aij = Aij jp with probability p and with probability 
1 — p. Then with probability at least 7-exp(19(logn) 4 ), 
the matrix N = A — A satisfies the following two equa- 
tions: 



(14) 



(15) 



HMklla <46 




1 We call our proposed method MACH, to acknowledge the fact that 
it is based on the Achlioptas-McSherry work. 



Randomized Tensor Algorithms As already dis- 
cussed, the most computationally expensive step for the 
Tucker decomposition is the SVD part. To alleviate this 
cost, two randomized algorithms which select columns 
according to a biased probability distribution for tensor 
decompositions [141 have been proposed, extending the 
results of ifTTTl and lfl3l to the multiway case and Tensor- 
CUR El, the extension of the CUR method (33) in n- 
modes. Roughly speaking, the bounds proved are of the 
form [13] Another approach to approximating the Tucker 
decomposition for the case of a three-way tensor is pre- 
sented in l36l . The proposed method matricizes the tensor 
as in all aforementioned algorithms and employes appro- 
priately the matrix approximation described in IfTSl . 

3 Proposed Method 

The proof of theorem[T]follows: 

Proof 1 Let £ = X — X where X = X Xj 
A^A^ T ... x d A^A^ T . Without loss of generality, 
let's assume t of equation\2\is minimum for index d, the 
last mode. Observe first that matrix Pi = AW A^ T for 
i = 1, . . . , d is an orthogonal projector. Specifically, Pi 
projects on the subspace spanned by the top r, left singu- 
lar vectors of the i-th matricization of tensor X. Therefore 
we have the following: 

\\£\\ = \\X-Xx 1 A^AW T x 2 ...x d A^A^ T \\ = 
\\X - X x d A^A^ T + X x d A^A^ T — X Xi 
A^A^ T ...x d A^A^ T \\ < \\X~Xx d A^A^ T \\ + 
\\{X — X Xi A^A^ T ... x d _! Xd 
A< d UW T || < \\X-X x d A^A^ T \\ + \\X-Xx 1 



_i A^-^A^-^l 



We obtained the above inequality by adding and sub- 
tracting tensor X x d A^ d ' A^ d ' T and applying the trian- 
gle inequality for a norm. The last line was obtained by 
using the fact that A^ d 'A^ T is a projector thus we can 
only reduce the norm if we project the d-th matricization 
oftensorX- X x 1 A^A^ T . . . x d _i A^" 1 ) A^~ X ) T 
along the d-th mode. 

Now consider the term \\X - X x d A^ A (rf)T | |. 
If we matricize this tensor along the d-th mode the 
Frobenius norm remains unchanged. Therefore \\X — 
X x d A^A^ T \\ = IIX 



(<>) 



X 



(d),r d \ 



Now we 



use the following inequality to further bound this resid- 



es 



Principal Component First Principal Component, First Principal Component 

"Machin e id" Mode (Tucker3) Time Aspect (Exact) Time Aspect (MACH} 




(a) (b) (c) 

Figure 3: (a) Top approximate Principal Component (PC) of the "machine-id" mode using the sampling MACH 
method vs. the exact PC. The PC was computed using a Tucker3 decomposition of the three-way tensor machine 
id x type of measurement x timeticks, formulated by data from the CMU monitoring system lETTl . MACH used 
approximately 10% of the original data. Pearson's correlation coefficient is shown in the inset, and is almost equal to 
the ideal value 1. Such PCs are of high practical value since they are used in outlier detection algorithms lETl l37l l47l . 
(b) Exact PC for the time aspect (c) Approximate PC using MACH. Pearson's correlation coefficient for the two time 
series equals 0.9772, again close to the ideal value 1. 



ual norm: \\X - X x d A^A^ T \\ = \\X (d) - 

x {d) , rd \\ < \\x {d} - x (dhrd \\ + 4by/%nti\i k + 
m\x {dh r d \\)H r futU^. 

The last inequality follows by combining two argu- 
ments which appear in H3j. Namely, for any matrices A 
andB, the following holds: \\A — Bk\\p < \\A — ^4fc 1 1 _f + 
2^\\(A- B) k \\ F \\A k \\ F + | \(A- B)k | \f Now substitut- 
ing for A the matrix X^ and for B k the matrix Xi d \ rd 
and using equation\T5\to upper-bound \ \{A — B) k \\ = 
\\(X — ^0r d || gives the last inequality, where k = r d in 
our case. Observe that we can use equation\T5\since the 
assumptions of Theorem\2\hold by our assumptions. 

Now consider the term \\X — X x i A^ A^ T . . . x d _i 
j^id-- 1 ) J ^( d ~ 1 ) T ^ |_ w e w ni recursively apply simple prop- 
erties of a norm and of a projector. Specifically: 

\\X - X xj A^A^ T . .. x d _ x A^A^-^W = 
WX-Xx^A^A^-^+Xx^A^A^- 1 ^- 
X x 1 A^A^ T ... x d _! A^A^-^W < 

\\x - x x d _i a^-va^-^w + \\(X - 

X Xl A^A^ T ... x d _ 2 A( d -V A( d -^ T ) x d _! 
A (d-i) A (d-i)T\\ < \\x-X x d _x AV-QaV-WW + 
\\X - X xj A^A^ T . . . x d _ 2 A^A^ T \\. 

Again we used the triangle inequality plus the fact that 
we can only reduce the norm if we project. Now repeating 



the same procedure to the last term and observing that for 
term kfor k=l,..,d-l the norm does not change if we ma- 
tricize with respect to that mode, we obtain the following 
simple upper bound: 

\\X - X xi A«A« T . . . x d _! A^A^-^W < 

Z)fc=l \ \Xk - Xk,r k \\ 

By combining the above results we get the desired in- 
equality. Three final remarks: observe that b is the maxi- 
mum of any matricization of our tensor and it is clear that 
since the above procedure gives for each aspect i an in- 
equality of the form \\X — X\\ < ti then \\X — X\\ < 
mini U- Finally the probability of success follows as the 
product of the success probabilities along each mode i. 

Remarks (1) TheoremQ] suggests algorithm 1, MACH- 
HOSVD. The algorithm takes as input a tensor X £ 
M. !l x - xl d and a vector containing the desired ranks of 
approximation along each mode (R\, . . . , R d ). MACH 
tosses a coin for each non-zero entry Xi lt .. i d of the 
tensor with probability p of keeping it and 1 — p for 
zeroing it. In case of keeping it, we reweigh it, i.e., 
Xi!,...,i d <— ■ Then we return as an approxima- 

tion to the HOS VD of tensor X the HOSVD of tensor X. 
The key idea behind proposing this algorithm is that for 
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First Principal Component (Exact) 



First Principal Component (MACH) 





Measurement ID 



Measurement ID 



Figure 4: Principal component for the "type of measurement" aspect for the Intel Lab Berkeley sensor network 
|[T0l . Ids 1 to 4 correspond to voltage, humidity, temperature and light intensity. As we observe, the PC captures the 
correlations between those types and MACH succeeds with p=0. 1 in preserving them accurately. 



any matricization along mode fc of tensor X we get that: 



= \\X 



(A:) 



X 



(k),r k 



< 



— n 



\\X - X x fe A^A^ T 

\\ X (k) - X (k),r k \\ 

Intuitively if tensor X has a good (ri, . . . , r<j) Tucker 
approximation, then matricization along mode fc has a 
good rk rank approximation. The sparsification allows 
us to approximate this low rank approximation X(k),r k by 

X (k),r k - 

(2) Frequently small r/s result in a satisfactory approx- 
imation of the original tensor. The sparsification process 
we propose due to its simplicity is easily parallelizable 
and can easily be adapted to the streaming case ETl by 
tossing a coin each time a new measurement arrives. (3) 
Picking the optimal p in a real world application can be 
hard, especially in the context we are interested in, i.e., 
monitoring systems, where data is constantly arriving. 
Another potential problem are the assumptions of the the- 
orem which may be violated. Fortunately, this does not 
render MACH algorithm useless. On the contrary, picking 
a constant p even for small tensors which do not satisfy 
the conditions of the theorem result turns out to be accu- 
rate enough to perform data analysis. Therefore, a practi- 
cioner in whose application constant factor speedups and 
space savings are significant can just choose a constant 



p. (4) The expected speedup depends on the "under-the- 
hood" method to find the top k singular vectors of a ma- 
trix. Lanczos method [ 17 | is such a method. Recently, ap- 
proximation algorithms approximate the fc-rank approxi- 
mation of a matrix in linear time l40l . Thus, if such a fast 
algorithm is used, the expected speedup is i. (5) Theo- 
rem Q] refers to the HOSVD of a tensor. We can apply 
the same idea to the HOOI. This results in algorithm 2. 
We do not analyze the performance of algorithm 2 here, 
since it would require the analysis of the convergence of 
the alternating least squares method which does not ex- 
ist yet. As we will show in the experimental section [4] 
MACH-HOOI gives satisfactory results. 



4 Experiments 

Experimental Setup We used the Tensor Toolbox 
lH, which contains MATLAB implementations of the 
HOSVD and the HOOI. Our experiments ran in a 2GB 
RAM, Intel(R) Core(TM)2 Duo CPU at 2.4GHz Ubuntu 
Linux machine. Table [3] describes the datasets we use. 
The motivation of our method as already mentioned, is to 
provide a practical algorithm for tensor decompositions 
which involve streams, such as monitoring systems. It 
is also worth noting that the assumptions of theorem Q] 
do not hold. Nonetheless, results are close to ideal. Fi- 
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Algorithm 1 MACH-HOSVD 



Require: X eR hx - XId 
Require: (n,.. . ,r d ) 
Require: p 

{MACH} 

for each X ix _ id , ij = 1 . . . L toss a coin with proba- 
bility p of keeping it. 
if success then 

y i 'i *d 



else 







end if{HOSVD} 
for i = 1 to d do 

A^ 1 ' <— ri leading left singular vectors of 
end for 

g^X x 1 A^ T x 2 A^ T ... x d A^ T 
return g, A^ , . . . , A^ 



Algorithm 2 MACH-HOOI 



Require: X e 
Require: (n , . 
Require: p 

{MACH} 
for each Xi, 



lX...Xl d 

,Td) 



bility p of keeping it. 
if success then 



1 . . . Ij toss a coin with proba- 







Xii,...,id 

else 

end if 

{HOOI} 

initialize e xrfc for fc = 1 . . . d using HOSVD 
repeat 
for i — 1 to d do 



y 

A (i+ 



1)T 



X Xl A« T . 



^(i-l)T v 
-1 A y ' X i+ i 



AW <— rj leading left singular vectors of 17 j) 
end for 

until fit stops improving or maximum number of itera- 
tions is reached 

<- * xi A« T x 2 ^( 2 ) T . . . x d A^ T 
return 0, • ■ • , 



Principal Component 
"Time" Mode (Tucker3) 



0.08 
0.07 
0.06 

a) 

•i 0.05 
go 

0.04 
Q. 

0.03 
0.02 
0.01 

0, 




Timetick 



Figure 5: Principal component for the time aspect using 
MACH with p=0.1. Daily periodicity appears to be the 
dominant latent factor for the time aspect. 



nally, in this section we report experimental results for 
the MACH-HOOI. The reason is that Tucker decomposi- 
tions using alternating least squares are used in practice 
more than the HOSVD and also, they have already been 
successfully applied to the real world problems we con- 
sider in the following l47ll . The results for HOSVD are 
consistently same or better than the results we report in 
this section. 



name 


h x h X ^3 


Sensor 
Network Data ( iflOll ) 


54-by-4-by 5385 


Intemon 
Data ( [EH ) 


100-by-12-by- 10080 



Table 3: Dataset summary. The third aspect is the time 
aspect. 



4.1 Monitoring computer networks 

As already mentioned in Section Q] a prototype monitor- 
ing system in Carnegie Mellon University uses data min- 
ing techniques successfully |37l |2T1 |47l to spot anoma- 
lies and detect correlations among different types of mea- 
surements and machines. Analyzing and applying these 
techniques on large amounts of data however is a chal- 
lenge. A natural way to model this type of data is a three- 
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(a) SENSOR Concept 1 (b) SENSOR Concept 1 using MACH 



Figure 6: (a) shows the distribution of the most dominant trend, (b) shows the distribution of the most dominant trend, 
using MACH algorithm with p=0.1. Pearson's correlation coefficient equals 0.93, and thus the qualitative analysis 
of the dominant sensor/spatial correlations remains unaffected by the sparsification. Colored bars indicate positive 
weights of the corresponding sensors. As suggested in fiTll . e values assigned to the sensors are more or less uniform 
suggesting that the dominant trend is strongly affected by the daily periodicity. 



way tensor, i.e., machine id x type of measurement x time. 
The data on which we apply MACH is a tensor X G 
jjioox 12x10080^ Tne first aspect j s t h e "machine id" as- 
pect and the second is the "type of measurement" as- 
pect (bytes received, unicast packets received, bytes sent, 
unicast packets sent, unprivileged CPU utilization, other 
CPU utilization, privileged CPU utilization, CPU idle 
time, available memory, number of users, number of pro- 
cesses and disk usage). The third aspect is the time aspect. 
Figure [3 a) plots the Principal Component (PC) of the 
"machine id" aspect after performing a Tucker3 decom- 
position using MACH versus the exact PC. Our sampling 
approach thus kept approximately the 10% of the original 
data. As the figure shows, the results are close to ideal 
and similar results hold for the other few top PCs. Specif- 
ically, Pearson's correlation coefficient is 0.99, close to 
the ideal 1 which is the perfect linear correlation between 
the exact and the approximate top PC. This fact is impor- 
tant since these PCs can be used to find outlier machines, 



which ideally would be the machines that face a function- 
ality problem. Figures 0b), 0c) show the exact top and 
the MACH PC for the time aspect. Pearson's correlation 
coefficient is equal to 0.98. We observe that there is no 
clear periodic pattern in this time series. The important 
fact is that MACH using only 10% of the data, results in a 
good approximation. This is of significant practical value 
and can be used also in conjunction with DTA [46] to per- 
form dynamic tensor analysis in larger time windows. 

4.2 Environmental Monitoring 

In this application we use data from the Intel Berkeley 
Research Lab sensor network [10|. The data is collected 
from 54 Mica2Dot sensors which measure at every timet- 
ick humidity, temperature, light and voltage. 

It has been shown in ll47l that tensor decompositions 
along with a wavelet analysis can efficiently capture 
anomalies in the network, i.e., battery outage as well as 
spatial and measurement correlations. In this section we 
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show that a random subset about 10% of the initial data 
volume suffices to perform the same analysis as if we had 
used the whole tensor. 

Figure [4] shows the correlations revealed by the the 
principal component for the "type of measurement" as- 
pect. As we observe, voltage, temperature and light in- 
tensity are positively correlated, whereas at the same time 
the latter types of measurement are negatively correlated 
with humidity. This is because during the day, tempera- 
ture and light intensity go up but humidity drops because 
the air conditioning system is on. Similarly , during the 
night, temperature and light intensity go down but humid- 
ity increases because the air conditioning system is off. 
Furthermore, the positive correlation between voltage and 
temperature is due to the design of MICA2 sensors. As we 
observe again, MACH gives the same qualitative analysis 
by examining the principal component. Pearson's corre- 
lation coefficient is close to the ideal value 1. Figure 5 
shows the principal component for the time aspect. A 
periodic pattern is apparent and corresponds to the daily 
periodicity. Performing a Tucker2 decomposition as sug- 
gested by 1471 and plotting the fiber of the core tensor 
corresponding to the principal components of the tensor 
for the "sensor id" and "measurement type" mode, the re- 
sults are again close to ideal. Figure[6{a) shows the prin- 
cipal component for the "sensor id" aspect using the exact 
Tucker decomposition and Figure[6fb) using MACH with 
p=0. 1 . The top component captures spatial correlations 
and MACH preserves them with a random subset of size 
approximately 10% of the original data. Pearson's corre- 
lation coefficient is equal to 0.93. 

4.3 Discussion 

General The above experiments show MACH results in 
a good approximation of the desired low rank Tucker ap- 
proximation of a tensor. Similar result hold for the other 
few top principal components of the Tucker decomposi- 
tion. Also, as already mentioned, results for HOSVD are 
consistently better or same with the reported ones, and 
the above applications were selected since it has already 
been shown by previous work that Tucker decompositions 
and S VD can detect anomalies and correlations. Thus, the 
main goal of this section is -rather than introducing new 
applications- to show that keeping a small random subset 
of the tensor can give good results. 



How to choose p? Choosing the best possible p is an is- 
sue. We use a constant p, i.e., p= 1 0% in our experiment^ 
Constant p's are of significant practical value in such set- 
tings where it is not clear how one should set p to sparsify 
the underlying tensor optimally. For "post-mortem" data 
analysis, one can try setting lower values for p according 
to theorem Q] 

Speedups & Synthetic Experiment Speedups due to 
the small size of the two datasets and the implemen- 
tation was less than the expected 10 x (typically 2-3 x 
faster performance). However, as the size of the tensor 
grows bigger (i.e., the number of non-zeros) the speedup 
should become apparent. For example consider a tensor 
X G K nxnxn , with Xijk = i + j + k and assume we want 
an (r,r,r) approximation. As shown in [16; 50] for an 
approximation with error e the rank grows logarithmically 
with n and e, satisfying inequality [TBI 

(16) r < C(lognlog 2 e) 

This tensor appears in numerical solutions of integral 
equations J36). A small numerical example for r = 4 and 
n = 200 gives the results in table [4] for p = 0.1. The 
second column of the table contains a vector of three val- 
ues (pi, P2, pz)- Pi i=l,2,3 is the correlation coefficient 
between the principal component of the exact Tucker3 
decomposition and the MACH Tucker3 decomposition 
of aspect i. As we see the correlation is almost perfect 
for all aspects. This is significant since the single im- 
portant interaction is betwen the first principal compo- 
nents. This can be seen by examining the core tenso^ 
The third column contains the accuracy of the approxima- 

tion, i.e., 1- - 1 -. As we 

see the speedup now becomes apparent, i.e., 7.52 x faster. 
Finally, when we attempt to run Tucker3 on a larger ten- 
sor with n=500, MATLAB runs out of memory, whereas 
when using p=0. 1 we can run the Tucker decomposition 

2 For both applications that value of p, gives excellent results. If we 
set p=5% for the first application results get significantly worse whereas 
for the second results remain good. 

3 The exact core tensor value which determines the interaction be- 
tween the top PCs is g(l, 1, 1) = 18.4856 and 18.4887 for the MACH 
decomposition. The next largest core tensor value has absolute value 
2.61«18.5. 
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and obtain an accurate precision. Observe that for this 
specific value of n the assumptions of theorem Q] do not 
hold, i.e., a = 200 2 thus p > = @2§*f£ » 

1. However results are satisfactory and this holds for even 
smaller values of p as one can verify. 



p 


P 


accuracy 


0.1 


(0.9967, 0.9955, 
0.9964) 


87% 


exact (sec) 


MACH(sec) 


speedup ( x faster) 


119.8 


15.92 


7.52 



Table 4: Synthetic experiment results 



5 Conclusions 

In this paper we focused on Tucker decompositions. We 
proposed MACH, a simple randomized algorithm which 
is embarassingly parallel and adapts easily to tensor 
streams, since it simply tosses a coin for each entry of 
the tensor. Specifically, our contributions include: 

• A new algorithm MACH, which keeps a small per- 
centage of the entries of a tensor, and still produces 
an accurate low rank approximation of the tensor. 
We performed a theoretical analysis of the algorithm 
in Theorem[T]and of its speedup in Section[3] 

• An experimental evaluation of MACH on two real 
world datasets, both generated from monitoring sys- 
tems, and on a synthetic one, where we showed that 
for constant values of p excellent performance. 

This algorithm will be incorporated in the PEGASUS 
software library (231 , a graph and tensor mining system for 
handling large amounts of data using Hadoop, the open 
source version of MapReduce (5J- Given the effective- 
ness of the sampling approach and that it is embarass- 
ingly parallel, it will be useful when dealing with huge 
amounts of data, given of course that the empirical ob- 
servation that low rank approximations are satisfactory in 
practice. The (in)effectiveness of MACH with respect to 
the PARAFAC/CANDECOMP decomposition will be in- 
vestigated in future work. 
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